Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14655 +/- ##
==========================================
- Coverage 32.36% 30.86% -1.51%
==========================================
Files 3097 3099 +2
Lines 210976 211013 +37
Branches 38232 37574 -658
==========================================
- Hits 68280 65124 -3156
- Misses 142696 145889 +3193
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
698cd42 to
2ea4f68
Compare
opencti-platform/opencti-graphql/src/modules/internal/document/document.ts
Outdated
Show resolved
Hide resolved
3635ce9 to
c4b70ac
Compare
|
Can you please create a public issue in OpenCTI repo and link it https://github.com/OpenCTI-Platform/opencti/issues |
1f89b8f to
c97d2f2
Compare
Done. I 'll make sure the correct issue number is in the commit message when squashing too 👍 . |
c97d2f2 to
f8aaca1
Compare
| typeof ATTACHMENT_MAPPINGS[number]['name'] | ||
| > extends never | ||
| ? MappingDefinition<BasicStoreAttribute>[] | ||
| : 'Make sure ATTACHMENT_MAPPINGS defines one mapping for each AttachmentProcessorExtractedProp'; |
There was a problem hiding this comment.
I don't understand well this part with this type check, could you explain it please? why is it needed?
There was a problem hiding this comment.
So basically the issue with this exception came from not having defined a mapping for each field extracted from the document.
The solution I suggest is configuring the attachment ingest thing to extract only the fields we want.
I wanted a way to statically make sure those two lists (the mappings and the ingest process config) remain in sync so I came up with this "hack" that uses an intermediate variable TYPE_CHECKED_ATTACHMENT_MAPPINGS on which we assert one of two types:
- either its correct type
MappingDefinition<BasicStoreAttribute>[], if no mappings are missing (thenevercase) - either a string literal actually explaining the problem
While this was fun, it's not maybe the most clear way to do that (:, so I pushed another refactor commit to create an assertType utility that's a noop at runtime but is useful for compile-time type strict equality checks. Let me know if that's better ?! With that method I can keep the mappings definition in place.
…ileIndexManager (#89) The issue arose because of missing index mappings in the attachment sub-document: we use an Elasticsearch pipeline processor for attachments that extracts fields for us. By default this processor extracts all the fields it can: https://www.elastic.co/guide/en/elasticsearch/reference/8.19/attachment.html#attachment-fields. This commit specifies which fields to extract: for those enforce an index mapping def
01100d7 to
ea8f73f
Compare
ea8f73f to
bf5d902
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses strict_dynamic_mapping_exception errors during file indexing by constraining which fields the Elasticsearch/OpenSearch attachment ingest processor is allowed to extract, aligning ingestion with the existing strict index mappings.
Changes:
- Configure the attachment ingest pipeline (
properties) to only extract fields that are mapped (separately for Elasticsearch vs OpenSearch). - Add an integration test indexing a PDF containing metadata that would previously trigger strict mapping failures.
- Add an OpenSearch dev Docker image build (with
ingest-attachmentplugin) and update dependent test expectations.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| opencti-platform/opencti-graphql/tests/03-integration/04-manager/retentionManager-test.ts | Updates expected file counts due to the new indexed test file. |
| opencti-platform/opencti-graphql/tests/03-integration/01-database/index-file-test.js | Adds an integration test to validate indexing succeeds with “unhandled” PDF metadata. |
| opencti-platform/opencti-graphql/src/utils/type-utils.ts | Adds TS utility types/helpers used for compile-time type assertions. |
| opencti-platform/opencti-graphql/src/modules/internal/document/document.ts | Adds a compile-time check to keep attachment mappings aligned with extracted props. |
| opencti-platform/opencti-graphql/src/database/engine.ts | Restricts ingest-attachment extracted properties for ES/OS pipelines. |
| opencti-platform/opencti-graphql/src/database/attachment-processor-props.ts | Defines the explicit extracted-property allowlists (ES vs OpenSearch) + shared union type. |
| opencti-platform/opencti-dev/opensearch/Dockerfile | Builds an OpenSearch image with the ingest-attachment plugin installed. |
| opencti-platform/opencti-dev/docker-compose.yml | Switches OpenSearch service to build: the new Dockerfile and updates usage hint. |
| const attachmentAttributes = attributes[18]; | ||
|
|
||
| type AttachmentAttributeMappingNames = typeof attachmentAttributes.mappings[number]['name'][]; | ||
|
|
There was a problem hiding this comment.
attributes[18] is a fragile way to reference the attachment attribute. If the attributes array order changes, this will silently point at the wrong element and can cause a runtime crash when accessing .mappings (or make the type assertion validate the wrong mapping). Prefer selecting the attachment definition by name === 'attachment' (and failing fast if not found) to make this resilient to reordering.
| const attachmentAttributes = attributes[18]; | |
| type AttachmentAttributeMappingNames = typeof attachmentAttributes.mappings[number]['name'][]; | |
| const attachmentAttributes = attributes.find((attribute) => attribute.name === 'attachment'); | |
| if (!attachmentAttributes || !attachmentAttributes.mappings) { | |
| throw new Error('Attachment attribute definition with mappings not found in internal file attributes'); | |
| } | |
| type AttachmentAttributeMappingNames = (typeof attachmentAttributes.mappings)[number]['name'][]; |
| remove_binary: true, | ||
| properties: ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_ELASTICSEARCH as Mutable<typeof ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_ELASTICSEARCH>, | ||
| }, |
There was a problem hiding this comment.
The properties value is being forced to a mutable array via a type cast. Instead of as Mutable<...>, consider creating a real mutable array (e.g., by spreading into a new array) so the value’s runtime shape matches the declared type and you can avoid unsafe assertions.
| // Union type of all properties extracted by the ES or OS attachment processor | ||
| export type AttachmentProcessorExtractedProp = Mutable<typeof ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_ELASTICSEARCH>[number] | ||
| | Mutable<typeof ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_OPENSEARCH>[number]; |
There was a problem hiding this comment.
Using Mutable<typeof ...>[number] to extract the union of literal values is unnecessarily complex for const arrays/tuples. typeof ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_ELASTICSEARCH[number] (and similarly for OpenSearch) should produce the same union type and makes the intent clearer.
|
|
||
| # Disabled by default, to run use: | ||
| # docker compose --profile opensearch up -d | ||
| # podman compose --profile opensearch up -d |
There was a problem hiding this comment.
The usage hint here was changed to podman compose, but the rest of this file still documents docker compose for other profiles. If Podman is not required, keep docker compose (or mention both) to avoid confusing contributors following the dev setup instructions.
| # podman compose --profile opensearch up -d | |
| # docker compose --profile opensearch up -d | |
| # or: podman compose --profile opensearch up -d |
Context
The issue arose because of missing index mappings in the attachment sub-document: we use an Elasticsearch pipeline processor for attachments that extracts fields for us. By default this processor extracts all the fields it can: https://www.elastic.co/guide/en/elasticsearch/reference/8.19/attachment.html#attachment-fields.
The problem is that we've created index mapping sfor only a subset of those fields (see
document.ts). This added to the fact that we enforcedynamic: strictbehavior on indices, meaning we don't let unknown fields be pushed on an index, resulted in a few exceptions.Proposed changes
This PR specifies which fields to extract when configuring the
attachmentpipeline in ES/OS: those for which we already have a mapping.We could consider ingesting the other pieces of data but I'm not sure it's useful and the volume is ultra low for now.
I added an integration test making sur that a PDF with a metadata (
dc:publisherordc:rating) that could be extracted by the processor, but that isn't because we now tell it not to, doesn't fail the indexing.I ran the test with ES and OS locally. Running with OS required tweaking the dev setup to install the
ingest-attachmentplugin before starting the OS process which requires a custom Dockerfile (https://docs.opensearch.org/latest/install-and-configure/install-opensearch/docker/#working-with-plugins).Related issues
Checklist
Further comments
I had to track down how to name the metadata by looking at the ES code (https://github.com/elastic/elasticsearch/blob/main/modules/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java#L200) and the library itself uses (Apache TIka). Without the
dc:prefix it wouldn't be seen by the processor.I used https://www.embedpdf.com/tools/pdf-metadata-editor to add metadata to the test file and tika to read them like the ES dependency.